Chapter 12

Comparing Proportions and Analyzing

Cross-Tabulations

IN THIS CHAPTER

Testing for association between categorical variables with the Pearson chi-square and Fisher

Exact tests

Estimating sample sizes for tests of association

Suppose that you are studying pain relief in patients with chronic arthritis. Some are taking

nonsteroidal anti-inflammatory drugs (NSAIDs), which are over-the-counter pain medications. But

others are trying cannabidiol (CBD), a new potential natural treatment for arthritis pain. You enroll

100 chronic arthritis patients in your study and you find that 60 participants are using CBD, while the

other 40 are using NSAIDs. You survey them to see if they get adequate pain relief. Then you record

what each participant says (pain relief or no pain relief). Your data file has two dichotomous

categorical variables: the treatment group (CBD or NSAIDs), and the outcome (pain relief or no pain

relief).

You find that 10 of the 40 participants taking NSAIDs reported pain relief, which is 25 percent. But 33

of the 60 taking CBD reported pain relief, which is 55 percent. CBD appears to increase the

percentage of participants experiencing pain relief by 30 percentage points. But can you be sure this

isn’t just a random sampling fluctuation?

Data from two potentially associated categorical variables is summarized as a cross-tabulation,

which is also called a cross-tab or a two-way table. Because we are studying the association between

two variables, this is a form of bivariate analysis. The rows of the cross-tab represent the different

categories (or levels) of one variable, and the columns represent the different levels of the other

variable. The cells of the table contain the count of the number of participants with the indicated levels

for the row and column variables. If one variable can be thought of as the “cause” or “predictor” of the

other, the cause variable becomes the rows, and the “outcome” or “effect” variable becomes the

columns. If the cause and outcome variables are both dichotomous, meaning they have only two levels

(like in this example), then the cross-tab has two rows and two columns. This structure contains four

cells containing counts, and is referred to as a 2-by-2 (or 2 × 2) cross-tab, or a fourfold table. Cross-

tabs are displayed with an extra row at the bottom and an extra column at the right to contain the sums

of the cells in the rows and columns of the table. These sums are called marginal totals, or just

marginals.

Comparing proportions based on a fourfold table is the simplest example of testing the association

between two categorical variables. More generally, the variables can have any number of categories,

so the cross-tab can be larger than 2 × 2, with multiple rows and many columns. But the basic question

to be answered is always the same: Is the spread of numbers across the columns so different from

one row to the next that the numbers can’t be explained away as random fluctuations? Another way